Center - mean, median, mode
Spread - range, variance, standard deviation
Our last graphs
To understand hypothesis testing need to understand standard normal distribution
Recall - sculpin in Toolik Lake
n = 208
mean = 51.69 mm
std dev = s = 12.02 mm
Weight distribution ~normal
You want to know things about this population like
Standard Normal Distribution
~68% of the curve area within +/- 1 σ of the mean,
~95% within +/- 2 σ of the mean,
~99.7% within +/- 3 σ of the mean
*remember σ = standard deviation
Areas under curve of Standard Normal Distribution
Done by converting original data points to z-scores
Thus:
Area under curve (probability) of standard normal distribution is known relative to z-values
Knowing z-value, can figure out corresponding area under the curve
What is the area under curve < 0?
Here is z-score table for right side or positive values of the z distribution (z > 0)
Numbers give area under the curve to left of a particular z-score
say 60 mm as a z score of 0.6916667
Area under curve (probability) of standard normal distribution is known relative to z-values
Knowing z-value, can figure out corresponding area under the curve
What is the area under curve < 0?
area of the curve is contained to the left of z = 1.22
What is the area of the curve is contained between of z = 0 and z=1.5?
What is the area of the curve is contained between of z = 0 and z=1.5?
To calculate this from a standard normal table:
To find the area under the standard normal curve between 0 and 1.5 using this table:
What is the area of the curve is contained to the left of z = -1?
What is the area of the curve is contained to the left of z = -1?
need to use the symmetry property of the standard normal distribution:
P(Z ≤ -1.0) = 1 - P(Z ≤ 1.0) = 1 - 0.8413 = 0.1587
Therefore, 15.87% of area falls to the left of z = -1.0
Take random samples from fish population:
3 random samples (each n=20) from population:
Notice the sample statistics and distributions
Every sample gives slightly different estimate of µ
Given above
can estimate the standard deviation of sample means
“Standard error of sample mean”
How good is your estimate of population mean? (based on the sample collected)
quantifies how much the sample means are expected to vary from samples
gives an estimate of the error associated with using \(\bar{y}\) to estimate \(\mu\)…
\(\sigma_{\bar{y}} = \frac{\sigma}{\sqrt{n}}\)
but rarely know σ, so use s \(s_{\bar{y}} = \frac{s}{\sqrt{n}}\) Where: \(s_{\bar{y}}\) = sample standard error of mean s = sample standard deviation n = sample size
Notice: - \(s_{\bar{y}}\) depends on - sample s (standard deviation) - sample n - (\(s_{\bar{y}} = \frac{s}{\sqrt{n}}\))
How and why? - Decreases with sample n - number - increases with sample s - standard deviation
Every sample gives slightly different estimate of µ (population mean)
Want to know how accurate our estimate of µ is from a sample
Do this by calculating confidence interval:
Often calculate 95% CIs
asdfasfasd
Formula for confidence interval
Where:
Formula for confidence interval
\(\text{95% CI} = \bar{y} \pm z \cdot \frac{\sigma}{\sqrt{n}}\)
95% of probability of SND is bw z= -1.96 and z=1.96
So for:
In the more typical case DON’T know the population σ - estimate it from the sample s When don’t know the population σ - and when sample size is < ~30) - can’t use the standard normal (z) distribution
Instead, we use Student’s t distribution
Student’s t distribution similar to SND
At df = ~30 - t distribution becomes close to z distribution
To calculate CI for sample from “unknown” population:
\(\text{CI} = \bar{y} \pm t \cdot \frac{s}{\sqrt{n}}\)
Where:
Here is a t-table
One-tailed questions: area of distribution left or (right) of a certain value
Two-tailed questions refer to area between certain values
Let’s calculate CIs again:
Use two-sided test
So:
For example
How would you assess this question using what we learned?
Let’s calculate the 95% CI for population X
Use two-sided test
95% CI Sample X: = 54 ± 1.984 * (10.9/(132^0.5)) = 1.882267 The 95% CI is between 52.12 and 55.88
Notice: the 95% confidence interval contains 51.7
Major goal of statistics:
inferences about populations from samples assign degree of confidence to inferences
Statistical H-testing:
formalized approach to inference
Relies on specifying null hypothesis (Ho) and alternate hypothesis (Ha)
Which p-value suggests Ho likely false?
At what point reject Ho?
p < 0.05 conventional “significance threshold” (α)
p < 0.05 means:
α is the rate at which we will reject a true null hypothesis (Type I error rate)
Lowering α will lower likelihood of incorrectly rejecting a true null hypothesis (e.g., 0.01, 0.001)
Both hypotheses and α are specified BEFORE collection of data and analysis
Traditionally α=0.05 is used as a cut off for rejecting null hypothesis
Nothing magical about 0.0 - actual p-values need to be reported.
| p-value range | Interpretation |
|---|---|
| P > 0.10 | No evidence against Ho - data appear consistent with Ho |
| 0.05 < P < 0.10 | Weak evidence against the Ho in favor of Ha |
| 0.01 < P < 0.05 | Moderate evidence against Ho in favor of Ha |
| 0.001 < P < 0.01 | Strong evidence against Ho in favor of Ha |
| P < 0.001 | Very strong evidence against Ho in favor of Ha |
Fisher:
p-value as informal measure of discrepancy betwen data and Ho
“If p is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 …”
s
General procedure for H testing:
General procedure for H testing: